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Abstract 

We introduce a new regression framework, Gaussian process regression networks (GPRN), 
which combines the structural properties of Bayesian neural networks with the nonparametric 
flexibility of Gaussian processes. This model accommodates input dependent signal and noise 
correlations between multiple response variables, input dependent length-scales and amplitudes, 
and heavy-tailed predictive distributions. We derive both efficient Markov chain Monte Carlo 
and variational Bayes inference procedures for this model. We apply GPRN as a multiple output 
regression and multivariate volatility model, demonstrating substantially improved performance 
over eight popular multiple output (multi-task) Gaussian process models and three multivariate 
volatility models on benchmark datasets, including a 1000 dimensional gene expression dataset. 



1 Introduction 



Gaussian process models have become exceptionally popular for solving non-linear regression and classi- 
fication problems. They are expressive, interpretable, avoid over-fitting, and have impressive predictive 



performance in many thorough empirical comparisons ( Rasmussen 1996 Kuss and Rasmussen 2005 Ras 



mussen and Williams , 2006 ) 



In machine learning, Gaussian process regression developed out of neural networks research. Neal 



( 1996 1 showed that Bayesian neural networks became Gaussian processes as the number of hidden units 



approached infinity, and conjectured that "there may be simpler ways to do inference in this case." These 
simple inference techniques became the cornerstone of subsequent Gaussian process models. However, 
neural networks had been motivated in part by their ability to capture correlations between multiple 
outputs (responses), by using adaptive hidden units that were shared between the outputs. In the infinite 
limit, this ability was lost. 

Recently there has been an explosion of interest in extending the Gaussian process regression framework 
to account for fixed correlations between output variables ( Alvarez and Lawrence 2011[ Yu et al 



2008 Osborne et al. |2008i Teh et al.[ 2005| Boyle and Frean 



2009 



Alvarez and Lawrence[ 2008 Bonilla et al. 
20041. These are often called 'multi-task' learning or 'multiple output' regression models. Capturing 
correlations between outputs (response variables) can be used to make better predictions. Imagine we 
wish to predict cadmium concentrations in a region of the Swiss Jura, where geologists are interested 
in heavy metal concentrations. A standard Gaussian process regression model would only be able to 
use cadmium training measurements. With a multi-task method, we can also make use of correlated 
heavy metal measurements to enhance cadmium predictions (Goovaerts 1997). We could further enhance 



predictions if we make use of how these (signal) correlations change with geographical location. 



*http://mlg.eng.cam.ac.uk/ andrew 
thttp:/ /mlg. eng. cam. ac.uk/dave 
http: / /mlg. eng. cam. ac.uk/zoubin 



There has similarly been great interest in extending Gaussian process (GP) regression to account 

Kersting et al. 2007[ Adams and Stegle 



for input dependent noise variances (Goldberg et al. 



Titsias 



1998 



2008] [Corner and Sahani[ [2008) |Turne r, 20113; [Wilson and Ghahramani[ |20lba|b[ |Llzaro-Gredilla and 



2011). Wilson and Ghahramani, ( |2010b 2011) and Fox and Dunson (2011) further extended the 



GP framework to accommodate input dependent noise correlations between multiple output (response) 
variables. 

Other extensions include Gaussian process regression with non-stationary covariance function ampli- 



tudes ( 


Turner and Sahani 


2008 


Adams and Stegle 


20081 and length-scales 


Gibbs 


1997 


Schmidt and 


O'Hagan 


2003 


), and with heavy tailed predictive distributions ( 


Neal 


1997 


Vanhatalo et al. 


2009 


) for 



outlier rejection ( |De Finetti[ |1956[ |Dawid[ |1973[ |0'Hagan[ |1979[ ). 

In this paper, we introduce a new regression framework, Gaussian Process Regression Networks (GPRN), 



which combines the structural properties of Bayesian neural networks (Neal 1996) with the nonparamet- 
ric flexibility of Gaussian processes. This network is an adaptive mixture of Gaussian processes, which 
naturally accommodates input dependent signal and noise correlations between multiple output variables, 
input dependent length-scales and amplitudes, and heavy tailed predictive distributions, without expensive 
or numerically unstable computations. 

We start by introducing the GPRN framework, and show how to perform efficient inference using both 



Markov chain Monte Carlo (MCMC) and variational Bayes (VB). Carefully following Alvarez and Lawrence 



(2011 ), we compare to eight multiple output GP models on gene expression and geostatistics datasets. We 
then compare to multivariate volatility models on several benchmark financial datasets, following ^Wilson] 



and Ghahramani (2010b). In the Appendix, we review Gaussian process regression and the notation of 



Rasmussen and Williams ( 2006 ) . 



2 Gaussian Process Regression Networks 

Given vector valued data points 2? = {2/(a^i) : i = 1, . . . , N}, where a; G A" is an arbitrary input variable, we 
aim to predict E[y(a;*)|a;*, P] and cov[y(a;*)|a;*, 2?] at a test input x^. We assume that the noise and signal 
correlations between the elements of y{x) may change as a function of x. Supposing x is time {x = t), a 
particular component of the p-dimensional vector y{t) could be the expression level of a particular gene 
at time t, and "Sit) would then represent the variances and correlations for these genes at time t. Instead 
of assuming only time dependence, we could have a more general input (predictor) variable x & X . Then 
we can imagine y{x) representing different heavy metal concentrations at a geographical location x G M^. 
These are two examples from the experiments in Section |4] 
We model y{x) as 

yix)^W{x)[fix)+afe]+ayZ (1) 

where e and z are i.i.d. J\f{0,I) white noiscj^ W{x) is a p x q matrix of independent Gaussian processes 
such that W{x)ij ^ 0^(0, k^), and f{x) = {fi{x), . . . , fq{x))^ is a q x 1 vector of independent GPs with 
fi{x) ~ GViO, kf.). Each of the latent Gaussian processes in f{x) have additive Gaussian noise. Changing 
variables to include the noise cr/e we let fi{x) ~ 01^(0, kj:.), where 

kj^{x,x') ^ kf^{x,x') + crj6^^> , (2) 

and 6xx' is the Kronecker delta. 

We represent this Gaussian process regression network (GPRN) in Figure [I] labelling the length-scale 
hyperparameters for the kernels k^ and kf as 6^ and Of respectively. We see the latent node functions 
f{x) are connected together to form the outputs y{x). The strengths of the connections change as a 
function of x; the weights themselves - the entries of W{x) - are functions. Old connections can break and 

^The distribution of z could be Student-t, Laplace, or something different. Using diagonal noise would also be a straight- 
forward extension. 
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a) 



b) 



Figure 1: The Gaussian process regression network. Latent random variables and observables are respectively 
labelled with circles and squares, except for the weight functions in a). Hyperparameters are labelled with dots, 
a) This neural network style diagram shows the q components of the vector / (GPs with additive noise), and the p 
components of the vector y. The links in the graph, four of which are labelled, are latent random weight functions. 
Every quantity in this graph depends on the input x. This graph emphasises the adaptive nature of this network: 
links can change strength or even disappear as x changes, b) A directed graphical model showing the generative 
procedure with relevant variables. 



new connections can form. This is an adaptive network, where the signal and noise correlations between 
the components of y{x) vary with x. 

To expUcitly separate the dynamic signal and noise correlations, we re-write ([T]) as 

y{x) ^ W{x ) f{x) + afW{x)e + OyZ^ . (3) 

signal noise 

Conditioning on W{x) in (|3|, we can better understand the signal correlations. In this case, each of 
the outputs yi{x), i = 1, . . . ,p, is a Gaussian process with kernel 

ky^{x,x')=J2w.,ix)[kf.ix,x') + a}]W,,{x') + al. (4) 

Even if a'j and ay are zero, so that this is a noise free regression, there are still signal correlations; the 
components of y are coupled through the matrix W{x). Once the network has been trained, W{x) is 
conditioned on the data V, and so the predictive covariances of y{x^)\'D are now influenced by the values 
of the observations themselves, and not just distances between the test point a;* and the observed points 
Xi,. . . ,xm as is the case for independent GPs; we can view Q as an adaptive kernel learned from the 
data. There are three other interesting features in equation Q: 1) the amplitude of the covariance function 
J2'j=i is non-stationary (input dependent); 2) even if each of the kernels kf. has different 

stationary length-scales, the mixture of the kernels fc/ is input dependent and so the effective overall 
length-scale is non-stationary; 3) the kernels fc/^. may be entirely different: some may be periodic, others 
squared exponential, others Brownian motion, etc. . So the overall covariance function (kernel) may be 
continuously switching between regions of entirely different covariance structures. 

^ Coincidentally, there is an unrelated paper called "Gaussian process networks" ( [Friedman and Nachma"ii| |2000| , which 
is about learning the structure of Bayesian networks - e.g. the direction of dependence between random variables. 



In addition to modelling signal correlations, we can see from equation ([s]) that the GPRN is simulta- 
usly a multivariate volatility model. The noise covariance is (j'jW {x)W {x)^ + cr^ J. Since the entries of 
W{x) are GPs, this noise model is an example of a generalised Wishart process ( [Wilson and Ghahramani 



2010b 2011) 



The number of nodes q influences how the model accounts for signal and noise correlations. If 9 is smaller 
than p, the dimension of y{x), the model performs dimensionality reduction and matrix factorization as 
part of the regression on y{x) and cov[y(a;)]. However, we may want q > p, for instance if the output space 
were one dimensional {p ~ I). In this case we would need g > 1 to realise features 2 and 3 listed above. 
For a given dataset, we can vary q and select the value which gives the highest marginal likelihood on 
training data. We can also use 'automatic relevance determination' (MacKay and Neal 1994) as a proxy 
for model selection for q for a given dataset. 



each node function j , so that k 



This is achieved by introducing {a^}, signal variances for 
and comparing magnitudes of the trained aj. 



When q = p = 1, the GPRN essentially becomes the nonstationary GP regression model of Adams and 



Stegle ( 2008 ) and Turner and Sahani ( 2008 ) . Likewise, when the weight functions are constants the GPRN 



becomes the Semiparametric Latent Factor Model (SLFM) of Teh et al. ( 2005 ) , except that the resulting 
GP regression network is less prone to over-fitting through its use of full Bayesian inference]^ Indeed, in 
an implementation of GPRN, one can switch features on or off; to switch off changing correlations and 
multivariate volatility, set cr^ = and the length-scales for weight function kernels {k^) to large fixed 
values. 



3 Inference 

Now that we have specified a prior at all points x in our domain X ^ we wish to predict E[y(a;^)|2;^, 2?] 

and cov[t/(a;*)|a;*,X'] at a test input x*, given vector valued data V — {y{xi) : i — 1,...,N}. We do 
this using two different approaches - variational Bayes and Markov chain Monte Carlo (MCMC) - and 
we compare between these approaches. We also use variational Bayes to estimate all hyperparameters 
7 = {0/, (T/, tjj,}, where, as before. Of and 0^, are the length-scales of the node and weight function 
kernels. 

As a first step, we re- write the prior in terms of m = (f,W), a vector composed of all the node and 
weight Gaussian process functions, evaluated at the training points {xi, . . . ,xn}- There will be q node 
functions and p x q weight functions. Therefore 

piu\cTf,ef,e^)^M{o,CB), (5) 

where Cb is an Nq{p + 1) x Nq{p + 1) block diagonal matrix, since the weight and node functions are 
independent in the prior. The way we have ordered u, the first q blocks are N x N covariance matrices 
from the node kernel kj, and the last blocks are N x N covariance matrices from the weight kernel 

k 

Next we specify our likelihood function, so we can use Bayes' theorem to find the posterior p{u\'D,j). 
From ([I]), our likelihood is 

N 

p{V\u,ay)^l[U{y{x,)■,W{x,)f{^^),<ylI)■ (6) 



i=l 



By incorporating noise on /, the GP network accounts for multivariate volatility (as in ([3|)), without the 
need for costly or numerically unstable matrix inversions. For other multivariate volatility models, like 
multivariate GARCH ( Bollerslev et al.[|l988J , or multivariate stochastic volatility (Harvey et al. 1994^), the 



likelihood takes the form p{'D\f3) = Y[i=i-^it^i''^i)^ ^^"^ requires inversions of p x p covariance matrices. 
There are three other notable advantages to the inference with GPRN: 1) it is easy to simultaneously 
estimate fii and S^. Usually in the multivariate volatility setting, fii is assumed to be a constant; 2) we 

3ln |Teh et al.| l |2005[ l, the weight constants are a large matrix of hyperparameters, determined through maximising a 
marginal hkelihood. 
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can use a Student-t observation model instead of a Gaussian observation model, by letting z in ([!]) be i 
distributed, with minimal changes to the inference procedures; 3) we can transform the comp onents of the 
product W{xi)f(xi) so that the priors on the components of y{xi) become copula processes (Wilson and 
Ghahramani 2010a) and have whatever marginals we desire. We can also do this without significantly 



changing inference procedures. 

Now that we have specified our prior and likelihood, we can apply Bayes' theorem: 

p(m|I?,7) oc p{T)\u,ay)p{u\0f,O^,af) . 



(7) 



In the next sections, we discuss how to either sample from or use variational Bayes to approximate this 
posterior in ([7|), so that we can estimate p{y{x^)\'D). We also use variational Bayes to learn the hyperpa- 
rameters 7. 

3.1 Markov chain Monte Carlo 

To sample from ([7|, we could use a Gibbs sampling scheme which would have conjugate posterior updates, 
alternately conditioning on weight and node functions. However, this Gibbs cycle would mix poorly because 
of the tight correlations between the weights and the nodes. In general, MCMC samples from ([T]) mix 
poorly because of the strong correlations in the prior imposed by Cb- The sampling process is also often 
slowed by costly matrix inversions in the likelihood. 



We use Elliptical Shce Samphng (Murray et al. 20101, a recent MCMC technique specifically designed 



to sample from posteriors with tightly correlated Gaussian priors. It does joint updates and has no free 
parameters. We find that it mixes well. And since there are no costly or numerically unstable matrix 
inversions in the likelihood of ([6| we also find sampling to be highly efficient. 

With a sample from ([t]), we can sample from the predictive p{W{Xi,), f{x^,)\u,af,'D). Let W^,fl be 
the z**^ such joint sample. Using ([s]) we can then construct samples of W*, fl, aj, ay), from which 

we can construct the predictive distribution 



1 ^ 

p{yix^)\V) = lim ^ Vp(y(2:,)|H/;,/:,cr/,crj^) 

J— >-oo J ^ — ' 
i—1 



(8) 



We see that even with a Gaussian observation model, the predictive distribution in ([8| is an infinite mixture 
of Gaussians, and will generally be heavy tailed and therefore robust to outliers. 

Mixing was assessed by looking at trace plots of samples, and the likelihoods of these samples. Specific 
information about how long it takes to sample a solution for a given problem is in the experiments section. 



3.2 Variational Bayes 



We perform variational EM ( [Jordan et al.[ 1999 ) to fit an approximate posterior q to the true posterior p, 
by minimising the KuUback-Leibler divergence KL{q\\p) — —H[q{v)] — J q{\-) logp(v)(iv, where H[q{v)] = 
~ J g(v) \ogq{-v)d-v is the entropy and v = {f , W, o-^,cry, aj}. 



E-step. We use Variat ional Message Passing (Winn and Bishop 2006) under the Infer.NET frame- 
work (|MinkaetaL, 20101 to estimate the posterior over v = {f , W, a'j,ay,aj}. We specify inverse Gamma 
priors on {aj,ay,ajy: 

ajj ^ IG(q!^2,/3^2), cr^ ^ IG(a^2,/3<^2), aj lG{aa,Pa)- 

For mathematical and computational convenience we introduce the following variables which are deter- 
ministic functions of the existing variables in the model: 



'^nij fl 



mj J nj J 



nj 
Sin 



— ^ tm 



(9) 
(10) 
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Note that the observations yi{xn) ~ A/'(si„,(Ty) and that /„j ~ ■^{fnj''^f )■ Variational message passing 
uses these deterministic factors and the associated "pseudo-marginals" as conduits to pass appropriate 



moments, resulting in the same updates as standard VB (Winn and Bishop 2006). The full model can 
now be written as 



p(v) cx IGial ; a,. , ) J] AA(f, ; 0, a,Kf^ ] 



p 



N 



lG{ajj; a„2, /3^2)lG{aj; aa, f3a)Y[ 

i=l . 

T 

n <5K,, - W^.ix^mr^^ - /,(a;„))A/-(/„,;/;,,^/; 

n=l 

j 

We use a variational posterior of the following form: 

Q P 

3=1 i=l 

N 

n 1w„,j (/ni)'?/„, ifnj)qt„,j (imj)gs„ (S^„) 



where q„2,q^j and are inverse Gamma distributions; 9io„ij , 9/' . , 9j .i9t„i3 ^-nd gs-^ are univariate 
normal distributions; and gf and 9Wi (W^) are multivariate normal distributions. 

The updates for f , W, aj, ay are standard VB updates and are available in Infer.NET. The update for 
the ARD parameters aj however required specific implementation. The factor itself is 



logAA(f,;0,a,A7) ^ --\og\a,K,\ - -iJ{a,Kf)-H, 
= -y loga, - \ log \K,\ 



(11) 



where = denotes equality up to an additive constant. Taking expectations with respect to f under q we 
obtain the VMP message to aj as being IG ^a^; — 1, \ {^J ^'^J^^j)^ ■ Since the variational posterior on f 
is multivariate normal the expectation {fjKj^fj) is straightforward to calculate. 



M-step. In the M-step we optimise the variational lower bound with respect to the log length scale 
parameters {6*/, 0^}, using gradient descent with line search. When optimising Of we only need to consider 
the contribution to the lower bound of the factor M{ij;0,ajKf.) (sec (111), which is straightforward to 
evaluate and differentiate (see Appendix). For 9^ we consider the contribution oi J^{Wpq;0,Kw). 



3.3 Computational Considerations 

GPRN is mainly limited by taking the Cholesky decomposition of the block diagonal Cb, an Nq{p + 1) x 
Nq{p + 1) matrix. But pq of these blocks are the same N x N covariance matrix K^, for the weight 
functions, and q of these blocks are the covariance matrices Kj associated with the node functions, and 
chol(blkdiag(A, i3, . . . )) = blkdiag(chol(A), chol(i3), . . . ). Therefore assuming the node functions share 
the same covariance function (which they do in our experiments), the complexity of this operation is only 
0{N^), the same as for regular Gaussian process regression. At worst it is 0{qN^), assuming different 
covariance functions for each node. 
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Sampling also requires likelihood evaluations. Since there are input dependent noise correlations be- 
tween the elements of the p dimensional observations y{xi), multivariate volatility models would normally 
require inverting apxp cov ariance matrix N tim es, like MGARCH (Bollerslev et al. 1988 ) or multivariate 
stochastic volatility models (Harvey et al. 1994). This would lead to a total complexity of 0{Nqp + Np^). 



However, by working directly with the noisy / instead of the noise free /, evaluating the likelihood requires 
no costly or numerically unstable inversions, and thus has a complexity of only 0{Nqp). This allows GPRN 
to scale to high dimensions; indeed we have a 1000 dimensional gene expression experiment in Section |4j 

The computational complexity of VB is dominated by the 0{N^) inversions required to calculate the 
covariance of the node and weight functions in the E-step. Naively q and qp such inversions are required per 
iteration for the node and weight functions respectively, giving a total complexity of 0{qpN^). However, 
under VB the covariances of the weight functions for the same p are all equal, reducing the complexity to 
0{qN^). lip is large the 0{pqN'^) cost of calculating the weight function means may become significant. 
Although the per iteration cost of VB is actually higher than for MCMC far fewer iterations are typically 
required to reach convergence. 

Overall, even though the GPRN accounts for input dependent signal correlations (rather than fixing 
the correlations like other multi-task methods) , the computational demands of GPRN compare favourably 



to most multi-task GP models, which commonly have a complexity of 0{p'^N^) (Alvarez and Lawrence 



4 Experiments 



We compare the GPRN to multi-task learning and multivariate volatility models, and we also use the 
GPRN to gain new scientific insights into the data we model. Furthermore, we compare between variational 
Bayes (VB) and Markov chain Monte Carlo (MCMC) inference within the GPRN framework. To keep our 
comparisons up to date, we exactly reproduce many of the experiments in recent papers by [Alvarez and| 
Lawrence ( |2011 ) and Wilson and Ghahramani (2010b) on benchmark datasets. In the multi-task setting, 
there are p dimensional observations y(x), and the goal is to use the correlations between the elements 
of y{x) to make better predictions of y(x*), for a test input x*, than if we were to treat the dimensions 
independently. A major difference between GPRN and alternative multi-task models is that the GPRN 
accounts for signal correlations that change with x, rather than fixed correlations. It also accounts for 
changing noise correlations (multivariate volatility). 

We compare to the following multi-task GP methods: 1) the linear model of coregionalisation (LMC) 
( Journel and Huijbregts 1978| Goovaerts{|1997 ), 2) the intrinsic coregionahsation model (ICM) ( [Goovaertsj 



19971, 3) ordinary co-kriging (Cressie 1993| Goovaerts 1997 Wackernagel[ [2003 ) , 4) the semiparametric 
latent factor model (S LFM) (|Teh et al. 2005), 5) convolved multiple output Gaussian processes (CMOGP) 
( Barry and Jay[ 1996[ Ver Hoef and Barry 1998| [Boyle and Frean 2004), 6) standard independent Gaussian 



processes (GP), 7) and the DTC ( Csato and Opper{ "2001 Seeger et al. 2003 Quinonero-Candela and 



Rasmussen[[2005[[Rasmussen and Williams[ [2006[ ) , 8) FITC ( [Snelson and Ghahramani[[2006[), and 9) PITC 
(Quinonero-Candela and Rasmussen 2005) sparse approximations for CMOGP (Alvarez and Lawrence 
2011[), which we respectively lab el as MDTC, MFITC and MPITC. Detail about each of these methods is 
in Alvarez and Lawrence (2011). We compare on a 3 dimensional geostatistics heavy metal dataset from 
the Swiss Jura, where 28% of the observations for one of the outputs (response variables) is missing, and 
on gene expression datasets with 50 and 1000 dimensional time dependent outputs y{t). 

In the multi-task experiments, the GPRN accounts for input dependent noise covariance matrices 
cav\y{x)] = E(a;). To specifically test GPRN's ability to model input dependent noise covariances (mul- 
tivariate volatility), we also compare predictions of S(a;) to those made by popular multivariate volatility 
models - full BEKK MGARCH (Engle and Kron"ei] 19951, generalised Wishart processes (Wilson and 



Ghahramani 2010b), the original Wishart process (Bru 1991 Gourieroux et al. 2009), and empirical esti 



mates - on benchmark return series datasets which are especially suited to MGARCH 


( Poon and Granger 


2005 


Hansen and Lunde 


2005 


Brownlees et al. 


2009 


McCuUough and Renfro 


1998 


Brooks et al. 


2001) 



In all experiments, GPRN uses a squared exponential covariance function for its node functions, and 
another squared exponential covariance function for its weight functions. 
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4.1 Gene Expression 



Tomancak et al.] ( |2002[ ) measured gene expression levels every hour for 12 hours during Drosophila embryo- 
genesis; they then repeated this experiment for an independent replica (a second independent time series). 
Gene expression is activated and deactivated by transcription factor proteins. We focus on genes which are 
thought to at least be regulated by the transcription factor twi, which influences mesoderm and muscle 



development in Drosophila (Zinzen et al. 2009). The assumption is that these gene expression levels are 
all correlated. We would like to use how these correlations change over time to make better predictions 
of time varying gene expression in the presence of transcription factors. In total there are 1621 genes 
(outputs) at A'^ = 12 time points (inputs), on two independent replicas. For training, p = 50 random genes 
were selected from the first replica, and the corresponding 50 genes in the second replica were used for 
testing. We then repeated this experiment 10 times with a different set of genes each time, and averaged 
the results. We then repeated the whole experiment, but with p = 1000 genes. We used exactly the same 



training and testing sets as Alvarez and Lawrence (2011 1 



We use a relatively small p = 50 dataset so that we are able to compare with popular alternative 
multi-task methods (LMC, CMOGP, SLFM) which have a complexity of ©(iV^p^) and would not scale to 
p — 1000 (Alvarez and Lawrence 2011 ). For p = 1000, we compare to the sparse convolved multiple output 
GP methods (MFITC, MDTC, and MPITC) of [Alvarez and Lawrence] ( |2011[ ). In both of these regressions, 
the GPRN is accounting for multivariate volatility; this is the first time a multivariate stochastic volatility 
model has been estimated for p > 50 ( |Chib et"aL 2006 ) . We assess performance using standardised mean 



square error (SMSE) and mean standardized log loss (MSLL), as defined in Rasmussen and Williams 



(2006) on page 23, and discussed in Alvarez and Lawrence (2011 1 on page 1469. Using the empirical mean 



and variance to fit the data would give an SMSE and MSLL of 1 and respectively. The smaller the SMSE 
and more negative the MSLL the better. 

The results are in Table [l] under the hea dings GENE (SOD) and GENE (l OOOD). For SET 1 of the 50D 
dataset, we used Rephca 1 and Replica 2 in Alvarez and Lawrence (2011) respectively as training and 
testing replicas. We follow [Alvarez and Lawrence ( 2011[ ) and reverse training and testing replicas to create 
SET 2. T he resu lts for LMC, CMOGP, MFITC, MPITC, and MDTC are reproduced from [Xlwezand] 



Lawrence (2011). GPRN significantly outperforms all of the other models, with between 46% and 68% 
of the SMSE, and similarly strong results on the MSLL error metric. On the 50D dataset the MCMC 
and VB results are comparable. However, on the lOOOD dataset GPRN with VB noticeably outperforms 
GPRN with MCMC, likely because MCMC is not mixing as well in high dimensions. Indeed VB may have 
an advantage over MCMC in high dimensions. On the other hand, GPRN with MCMC is still robust, 
outperforming all the other methods on the 1000 dimensional dataset. 

On both the 50 and 1000 dimensional datasets, the marginal likelihood for the network structure is 
sharply peaked at g = 1. This is evidence for the hypothesis that there is only the one transcription 
factor twi controlling the expression levels of the genes in question. These datasets can be found in Neil 
Lawrence's GPSIM toolbox: http://staffwww.dcs.shef.ac.Uk/people/N.Lawrence/gpsim/ 

Typical GPRN (VB) runtimes for the 50D and lOOOD datasets were respectively 12 seconds and 330 
seconds. 



4.2 Jura Geostatistics 

Here we are interested in predicting concentrations of cadmium at 100 locations within a 14.5 km^ region 
of the Swiss Jura. For training, we have access to measurements of cadmium at 259 neighbouring locations. 
We also have access to nickel and zinc concentrations at these 259 locations, as well as at the 100 locations 
we wish to predict cadmium. While a standard Gaussian process regression model would only be able to 
make use of the cadmium training measurements, a multi-task method can use the correlated nickel and 
zinc measurements to enhance prediction^ With GPRN we can also make use of how the correlations 
between nickel, zinc, and cadmium change with location to further enhance predictions. 

Here the network structure with by far the highest marginal likelihood has q — 2 latent node functions. 
The node and weight functions learnt using VB for this setting are shown in Figure [2] Since there are 

*This can be seen as a multivariate missing data problem, with p = 3 outputs. 
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Figure 2: Network structure for the Jura dataset learnt by GPRN. The layout here is as in Figure [T] with the 
spatially varying node and weight functions shown, along with the predictive means for the observations. The three 
output dimensions are cadmium, nickel and zinc concentrations respectively. 



p — 3 output dimensions, the result q < p suggests that heavy metal concentrations in the Swiss Jura are 
correlated. Indeed, using our model we can observe the spatially varying correlations between heavy metal 
concentrations, as shown for cadmium and zinc in Figure [3j Although the correlation between cadmium 
and zinc is generally positive (with values around 0.6), there is a region where the correlations drop of 
noticeably, perhaps corresponding to a geological structure. The quantitative results in Table 1 confirm 
that the ability of GPRN to learn these spatially varying correlations is beneficial in terms of being able 
to predict cadmium concentrations. 

We assess performance quantitatively using mean absolute error (MAE) between the predicted and 
true cadmium concentrations. We restart the experiment 10 times with different initialisations of the 
parameters, and average the MAE. The results are marked by JURA in Table [T] This experiment follows 
Goovaerts] (|1997[) and [Alvarez and Lawren"ce| ( |2011[ ). The resuhs for SL FM, ICM and CM OGP are from 
Alvarez and Lawrence (2011), and the results for co-kriging are from Goovaerts (1997). It is unclear 



what preprocessing was performed for these methods, but we found log transforming and normalising each 
dimension to have zero mean and unit variance to be beneficial due to the skewed distribution of the 
y- values (but we also include results on untransformed data, marked with *). All of the multiple output 
methods give lower MAE than using an independent GP, and GPRN outperforms SLFM and the other 
multiple output methods. 

For the JURA dataset, the improved performance of GPRN is at the cost of a slightly greater runtime. 
However, GPRN is accounting for input dependent signal and noise correlations, unlike the other methods. 
Moreover, the complexity of GPRN scales with p as 0{Nqp), unlike the other methods which scale as 
0{N^p^) (Alvarez and Lawrence 2011 ). This is why GPRN runs relatively quickly on the 1000 dimensional 



gene expression dataset, for which the other methods are intractable. This data is available from http: 
//www. ai-geostats . orgT] 



4.3 Multivariate Volatility 

In the previous experiments the GPRN implicitly accounted for multivariate volatility (input dependent 
noise covariance) in making predictions of y(a;*). The GPRN incorporates a generalised Wishart process 



(Wilson and Ghahramani 2010b 2011) noise model into a more general model which can also account 
for signal correlations and other nonstationarities. Although our focus is the ability of GPRN to model 
input dependent correlations in a multiple output regression setting, we here test the GPRN explicitly as 
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Figure 3: Spatially dependent correlation between cadmium and zinc learnt by the GPRN. Markers show the 
locations where measurements were made. 



a model of multivariate volatility, and assess predictions of Y,{t) = cov[y{t)], where the observations y are 
time dependent. We make historical predictions at observed time points, and one day ahead forecasts. 
Historical predictions can be used, for example, to understand a past financial crisis. We follow |Wilson| 



and Ghahramani (2010b) exactly, and predict for returns on three currency exchanges (EXCHANGE) 
and five equity indices (EQUITY) processed exactly as in Wilson and Ghahramani (2010b I. These datasets 
are especially suited to MGARCH, the most popular multivariate volatility model, and have become a 



benchmark for assessing GARCH models (Poon and Granger 


2005 


et al.| 


2009 


McCuUough and Renfro 


1998 


Brooks et al. 


20011. A 



2005 Brownlees 



T,(x) and 200 one step ahead forecasts. The forecasts are assessed using the log likelihood of the new 
observations under the predicted covariance, denoted £ Forecast. We compare to full BEKK MGARCH 
( [Engle and Kroner 19951, the generalised Wishart process (Wilson and Ghahramani 2010b), the original 



Wishart process (Bru 1991 Gourieroux et al. 2009), and using the empirical covariance of the training 
set. We see in Table [ij that GPRN (VB) is competitive with MGARCH, even though these datasets are 
particularly suited to MGARCH. The Historical MSB for EXCHANGE is between the learnt covariance T,{x) 
and y{x)y{x)^ , so the high MSE values for GPRN on EXCHANGE are essentially training error, and are 
less meaningful than the encouraging step ahead forecast likelihoods. The historical predictions are more 
relevant in EQUITY, where we can compare to the true covariances. See Wilson and Ghahramani (2010b) 
for details. 



GPRN and GWP are both highly flexible but fully Bayesian models for multivariate volatility, so it 
understandable that their performance is comparable. While GPRN (MCMC) sometimes outperforms 
MGARCH, the GWP, and the WP, it is often outperformed by GPRN (VB) on the multivariate volatility, 
perhaps suggesting convergence problems. These data were obtained using Bloomberg (http://www .] 
bloomberg . com/ ) . 
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Table 1: Comparative performance on all datasets. 



VjEjIMIL \ -)\ILJ J 






SET 1: 






GPRN (VB) 


0.3356 ± 0.0294 


—0.5945 ± 0.0536 


GPRN (MCMC) 


0.3236 ± 0.0311 


-0.5523 ± 0.0478 


LMC 


0.6909 ± 0.0294 


-0.2687 ± 0.0594 




0.4859 ± 0.0387 


—0.3617 ± 0.0511 


SLFM 


0.6435 ± 0.0657 


-0.2376 ± 0.0456 


SET 2: 






GPRN fVBl 


3403 ± 0339 


—0 6142 ± 0557 


GPRN fMGMGt 


H2f\R + 0.^21 


—0 568.3 + 0,542 


LMC 


6194 ± 0447 


—0 2360 ± 0696 


CMOGP 


4615 ± 0626 


—0 3811 ± 0748 


SLFM 


0.6264 ± 0.0610 


-0.2528 ± 0.0453 


GENE (lOOOD) 


Average SMSE 


Average MSLL 


GPRN (VB) 


0.3473 ± 0.0062 


-0.6209 ± 0.0085 


GPRN (MCMC) 


0.4520 ± 0.0079 


-0.4712 ±0.0327 


Mr 1 i U 


0.5409 ± U.Oizo 


— 0.31z4 ± U.0/00 


MPITC 


5537 ± 0136 


—0 3162 ± 0206 


MDTC 


0.5421 ± 0.0085 


-0.2493 ± 0.0183 


JURA 


Average MAE 


Training Time (sees) 


GPRN (VB) 


0.4040 ± 0.0006 


3781 


GPRN* (VB) 


0.4525 ± 0.0036 


4560 


SLFM (VB) 


0.4247 ± 0.0004 


1643 


OL/J? iVl ( V 13 1 




ioOU 


SLFM 


0.4578 ± 0.0025 


792 


Co-kriging 


0.51 




ICM 


0.4608 ±0.0025 


507 


CMOGP 


0.4552 ± 0.0013 


784 


GP 


0.5739 ± 0.0003 


74 


EXCHANGE 


Historical MSE 


£, Forecast 


GPRN (VB) 


3.83 X 10"^ 


2073 


GPRN (MCMC) 


6.120 X 10"^ 


2012 


GWP 


3.88 X 10-^ 


2020 


WP 


3.88 X 10-^ 


1950 


MGARCH 


3.96 X 10"^ 


2050 


Empirical 


4.14 X 10~^ 


2006 


EQUITY 


Historical MSE 


jC Forecast 


GPRN (VB) 


0.978 X 10"^ 


2740 


GPRN (MCMC) 


0.827 X 10-^ 


2630 


GWP 


2.80 X 10"^ 


2930 


WP 


3.96 X 10"'' 


1710 


MGARCH 


6.69 X 10"^ 


2760 


Empirical 


7.57 X 10"^ 


2370 
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5 Discussion 



A Gaussian process regression network (GPRN) has a simple and interpretable structure, and generalises 
many of the recent extensions to the Gaussian process regression framework. The model naturally accom- 
modates input dependent signal and noise correlations between multiple output variables, heavy tailed 
predictive distributions, input dependent length-scales and amplitudes, and adaptive covariance functions. 
Furthermore, GPRN has scalable inference procedures, and strong empirical performance on several bench- 
mark datasets. 

In the future, it would be enlightening to use GPRN with different types of adaptive covariance struc- 
tures, particularly in the case where p — 1 and q > 1; in one dimensional output space it would be easy, 
for instance, to visualise a process gradually switching between brownian motion, periodic, and smooth 
covariance functions. It would also be interesting to apply this adaptive network to classification. We hope 
the GPRN will inspire further research into adaptive networks, and further connections between different 
areas of machine learning and statistics. 
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7 Appendix 

7.1 Gaussian processes 

Since we extend the Gaussian process framework, we briefly review Gaussian process regression, some 
notation, and expand on some of the points in the introduction. For more detail see [Rasmussen and] 



Williams (2006) 



A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian 
distribution. Using a Gaussian process, we can define a distribution over functions w{x): 

w{x) ^ g'P{rn{x),k{x,x')), (12) 

where w is the output variable, x is an arbitrary (potentially vector valued) input variable, and the mean 
m{x) and covariance function (or kernel) k(x, x') are respectively defined as 

m{x) ^¥.[w(x)], (13) 
k{x, x) — cov[it;(a;), w{x')] . (14) 

This means that any collection of function values has a joint Gaussian distribution: 

{w{xi),w{x2), . . . , w{xN)y ^ Af(p, K) , (15) 

where the N x N covariance matrix K has entries Kij — kixi, Xj), and the mean /x has entries /x^ — m{xi). 
The properties of these functions (smoothness, periodicity, etc.) are determined by the kernel function. 
The squared exponential kernel is popular: 

ksE.{x,x') ^exp(-0.5||a;-x'||V^^)- (16) 

Functions drawn from a Gaussian process with this kernel function are smooth, and can display long 
range trends. The length-scale hyperparameter I is easy to interpret: it determines how much the function 
values w{x) and w{x + a) depend on one another, for some constant a Cz X. When the length-scale is 
learned from data, it is useful for determining how far into the past one should look in order to make good 
forecasts. A G K is the amplitude coefficient, which determines the marginal variance of w{x) in the prior, 
Var[u'(a;)] = A, and the magnitude of covariances between w{x) at different inputs x. 
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The Ornstcin-Uhlcnbeck kernel is also widely applied: 



fcou(a^, x') = exp(— — . 



(17) 



In one dimension it is the covariance function of an Ornstein-Uhlcnbeck process ( Uhlenback and Ornstein 



19301, which was introduced to model the velocity of a particle undergoing Brownian motion. With this 
kernel, the corresponding GP is a continuous time AR(1) process with Markovian dynamics: w{x + a) is 
independent of w(x — a) given wi^x) for any constant a. Indeed the OU kernel belongs to a more general 
class of Matern kernels, 



^Matern 



{x,x') 



'2a\\x - 



r(«) 



I 



^\\x- 



(18) 



where Ka is a modified Bessel function ( Abramowitz and Stegun 



a 



1/2 



1964 ). In one dimension the correspond- 
jThe OU kernel is recovered by setting 



ing GP is a continuous time AR(p) process, where p 
a = 1/2. 

There are many other useful kernels, like the periodic kernel (with a period that can be learned from 
data), or the Gibbs kernel (Gibbs 19971 which allows for input dependent length-scales. Kernels can 
be combined together, e.g. k — aiki + 02^2 + 03^3, and the relative importance of each kernel can be 
determined from data (e.g. from estimating oi , 02 , 03) . Rasmussen and Williams ( 2006 1 and Bishop ( |2006 1 



have a discussion about how to create and combine kernels. 

Suppose we are doing a regression using points {y{xi), . . . , y{xN)} from a noisy function y = w{x) + e, 
where e is additive i.i.d Gaussian noise, such that e ^ Af{0,a^). Letting y = {y{xi), . . . ,y{xN))~^ , and 
w = (w(xi), . . . ,w{xn)^ , we have p{y\w) = JV{w,afj) and p{w) = jV{fi,K) as above. For notational 
simplicity, we assume /x = 0. For a test point w(x.^,), the joint distribution p('w{x.^,), y) is Gaussian: 



w{x^,) 
w 



■AA(0, 



K 



(19) 



where K is defined as above, and (fc*)i — k(x^, Xi) with i = 1, . . . , A'^. We can therefore condition on y to 
find p{w{x^,)\y) =A/'(^*,w*) where 



^Ji,=kl{K + alI)-'y, 
k{x^,x^) - kJ{K + aliy^k^ . 



(20) 
(21) 



We can find this more laboriously by noting that p{w\y) and p{w{x^,)\w) are Gaussian and integrating, 
since p{w{x^)\y) = J p{'w{x^,)\w)p{w\y)dw. 



We see that (21) doesn't depend on the data y, just on how far away the test point a;* is from the 
training inputs {xi, . . . , xn}- 

In regards to the introduction, we also see that for this standard Gaussian process regression, the 



observation model p(y\w) is Gaussian, the predictive distribution in ( 20 ) and ( 21 1 is Gaussian, the marginals 
in the prior (from marginalising equation (15)) are Gaussian, the noise is constant, and in the popular 
covariance functions given, the amplitude and length-scale are constant. A brief discussion of multiple 
outputs, noise models with dependencies, and non-Gaussian observation models can be found in sections 



9.1, 9.2 and 9.3 on pages 190-191 of [Rasmussen and Williams (2006), available free online at the book 
website www . gau ss isinprocess . org/ gpnil| An example of an input dependent length-scale is in section 
4.2 on page 43. 



7.2 Constraining W 

It is possible to reduce the number of modes in the posterior by somehow constraining the weights W 
to be positive. For MCMC it is straightforward to do this by exponentiating the weights, as in |Adams| 

^Discrete time autoregressive processes such as w{t -|- 1) = w{t) + where e{t) ~ M{0, 1), are widely used in time series 
modelhng and are a particularly simple special case of Gaussian processes. 



and Stegle 


(2008 


) and 


Adams et al. 


(2010 



weights to be positive using a truncated Gaussian representation. We found that these extensions did 
not significantly improve empirical performance, although exponentiating the weights sometimes improved 



numerical stability for MCMC on the multivariate volatility experiments. For Adams and Stegle (2008) 
exponentiating the weights will have been more valuable because they use Expectation Propagation which 
is known to perform badly in the presence of multimodality. MCMC and VB approaches are more robust 
to this problem. 

7.3 VB M-step 



From ( 11 ) we have: 



{\ogN{i,;Q,a,Kf.)), 

= - f log a, - \ log \Kj^ I - ^ (a-') ^^J ^j'^j) 



We will need the gradient with respect to 6'/: 

dOf 



The expectations here are straightforward to compute analytically. 



7.4 VB predictive distributions 

The predictive distribution is calculated as 



v{y*{x)\V) - y v{y*{x)\W{x)J(x))v{W{x)J{x)\V)dWdf (22) 

VB fits the approximation j){W(x)^ ]{x)\D) — (?(M^)(i'(/), so the approximate predictive is 

v{y*{x)\V) = y v{y\x)\W{x)J{x))qm^{!)dWdf (23) 

We can calculate the mean and covariance of this distribution analytically: 

r(2:). = 5]E(W,fc)E[/fc] (24) 

k 

cov(y*(a;)),, ^ ^[E(Ty,fe)]E(W^jfc)var(/fc) + %(var(M^,fe)E(/2))] + (25) 

k 

It is also of interest to calculate the noise covariance. Recall our model can be written as 

y{x) :^W{x)f{x)+afW{x)e + ayZ (26) 

signal noise 

Let n = afW{x)e + (jyZ be the noise. The covariance of n is then 

cov(n)y = ^[E[a^JE(T^,fe)E(l^,fe) + 5,,Y&T{Wjk)] + S.^al (27) 
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